Search CORE

29 research outputs found

Fine-Grained Linguistic Soft Constraints on Statistical Natural Language Processing Models

Author: Marton Yuval Yehezkel
Publication venue
Publication date: 01/01/2009
Field of study

This dissertation focuses on effective combination of data-driven natural language processing (NLP) approaches with linguistic knowledge sources that are based on manual text annotation or word grouping according to semantic commonalities. I gainfully apply fine-grained linguistic soft constraints -- of syntactic or semantic nature -- on statistical NLP models, evaluated in end-to-end state-of-the-art statistical machine translation (SMT) systems. The introduction of semantic soft constraints involves intrinsic evaluation on word-pair similarity ranking tasks, extension from words to phrases, application in a novel distributional paraphrase generation technique, and an introduction of a generalized framework of which these soft semantic and syntactic constraints can be viewed as instances, and in which they can be potentially combined. Fine granularity is key in the successful combination of these soft constraints, in many cases. I show how to softly constrain SMT models by adding fine-grained weighted features, each preferring translation of only a specific syntactic constituent. Previous attempts using coarse-grained features yielded negative results. I also show how to softly constrain corpus-based semantic models of words (“distributional profiles”) to effectively create word-sense-aware models, by using semantic word grouping information found in a manually compiled thesaurus. Previous attempts, using hard constraints and resulting in aggregated, coarse-grained models, yielded lower gains. A novel paraphrase generation technique incorporating these soft semantic constraints is then also evaluated in a SMT system. This paraphrasing technique is based on the Distributional Hypothesis. The main advantage of this novel technique over current “pivoting” techniques for paraphrasing is the independence from parallel texts, which are a limited resource. The evaluation is done by augmenting translation models with paraphrase-based translation rules, where fine-grained scoring of paraphrase-based rules yields significantly higher gains. The model augmentation includes a novel semantic reinforcement component: In many cases there are alternative paths of generating a paraphrase-based translation rule. Each of these paths reinforces a dedicated score for the “goodness” of the new translation rule. This augmented score is then used as a soft constraint, in a weighted log-linear feature, letting the translation model learn how much to “trust” the paraphrase-based translation rules. The work reported here is the first to use distributional semantic similarity measures to improve performance of an end-to-end phrase-based SMT system. The unified framework for statistical NLP models with soft linguistic constraints enables, in principle, the combination of both semantic and syntactic constraints -- and potentially other constraints, too -- in a single SMT model

Digital Repository at the University of Maryland

Domain-Independent Novel Event Discovery and Semi-Automatic Event Annotation

Author: Ji Heng
Li Hao
Li Xiang
Marton Yuval
Publication venue: Institute of Digital Enhancement of Cognitive Processing, Waseda University
Publication date: 01/01/2011
Field of study

Waseda University Repository

Creative destruction in science

Author: Abraham Ajay T.
Adam-Troian Jais
Adamkovic Matus
Anand Rahul
Arbeau Kelly J.
Awtrey Eli C.
Azar Ofer H.
Bahn\uedk \u160t\u11bp\ue1n
Ban\uedk Gabriel
Barbosa Mendes Ana
Barger Michael M.
Barretto Yaranon Paolo
Baskin Ernest
Bavolar Jozef
Berkers Ruud M. W. J.
Besco Randy
Bia\u142ek Micha\u142
Bishop Michael M.
Bonache Helena
Boufkhed Sabah
Brandt Mark J.
Butterfield Max E.
Byrd Nick
Caton Neil R.
Ceynar Michelle L.
Corcoran Mike
Costello Thomas H.
Cramblet Alvarez Leslie D.
Cummins Jamie
Curry Oliver S.
Daniels David P.
Daskalo Lea L.
Daum-Avital Liora
Day Martin V.
Deeg Matthew D.
Dennehy Tara C.
Dietl Erik
Dimant Eugen
Domurat Artur
Dreber Anna
du Plessis Christilene
Dubrov Dmitrii
Ebersole Charles R.
Elena Schleu Joyce
Elsherif Mahmoud M.
Engel Yuval
Fellenz Martin R.
Field Sarahanne M.
Firat Mustafa
Freitag Raquel M. K.
Friedmann Enav
Ghasemi Omid
Giulia Clemente Elena
Goldberg Matthew H.
Gordon Michael
Gourdon-Kanhukamwe Am\ue9lie
Graf-Vlachy Lorenz
Griffith Jennifer A.
Grigoryev Dmitry
H\ufcffmeier Joachim
Hafenbr\ue4dl Sebastian
Hagmann David
Hales Andrew H.
Han Hyemin
Hardy Jay
Harman Jason L.
Hartanto Andree
Holding Benjamin C.
Hopfensitz Astrid
Huntsinger Jeffrey R.
Idzikowska Katarzyna
Innes-Ker \uc5se H.
Jaeger Bastian
Jankowsky Kristin
Jarvis Shoshana N.
Jha Nilotpal
Jimenez-Gomez David
Jin Pedersen Mogens
Johannesson Magnus
Jolles Daniel
Jozefiakova Bibiana
Ka\u10dm\ue1r Pavol
Kappmeier Mariska
Kasper Matthias
Keller Lucas
Kit Yeung Siu
Knapic Viktorija
Knutsson Mikael
Kombeiz Olga
Kowal Marta
Krekels Goedele
Laine Tei
Lakens Daniel
Leavitt Keith
Li Bingjie
Lo Ronda F.
Ludwig Jonas
Luis Uhlmann Eric
Marcus James C.
Marsh Melvin S.
Martinoli Mario
Marton\u10dik Marcel
Master Allison
Masters-Waage Theodore C.
May Rowe Andrea
Mayiwar Lewend
Mazei Jens
McCarthy Gemma S.
McCarthy Randy J.
Mertens Stephanie
Micheli Leticia
Miklikowska Marta
Miron-Shatz Talya
Montealegre Andres
Moreau David
Moret-Tatay Carmen
Negrini Marcello
Newall Philip W. S.
Nilsonne Gustav
Niszczota Pawe\u142
Nobel Nurit
O'Mahony Aoife
O'Shea Deirdre
Orhan Mehmet A.
Oswald Flora E.
Panning Miriam
Pantelis Peter C.
Paruzel-Czachura Mariola
Pennycook Gordon
Pfeiffer Thomas
Plonsky Ori
Polito Vince
Price Paul C.
Primbs Maximilian A.
Protzko John
Quayle Michael
Rahal Rima-Maria
Redford Liz
Reggev Niv
Reynolds Caleb J.
Roczniewska Marta
Ropovik Ivan
Ross Robert M.
Roulet Thomas J.
Saccardo Silvia
Samahita Margaret
Schaerer Michael
Schuetze Brendan A.
Senftleben Ulrike
Seri Raffaello
Shahinoor Rahman Md.
Shtudiner Zeev
Shuai Jack
Sin Ray
Singh Aneeha
Singh Varsha
Siong Tey Kian
Sokolova Tatiana
Song Victoria
Stafford Tom
Stanulewicz Natalia
Stevens Samantha M.
Str\uf8mland Eirik
Stronge Samantha
Sweeney Kevin P.
Tannenbaum David
Tepper Stephanie J.
Tierney Warren
Ting Hsuchi
Tingen Ian W.
Todorovic Ana
Tse Hannah M. Y.
Tybur Joshua M.
Viganola Domenico
Vineyard Gerald H.
Voslinsky Alisa
Vranka Marek A.
Wai Jonathan
Walker Alexander C.
Wallace Laura E.
Wang Tianlin
Werz Johanna M.
Woike Jan K.
Wollbrant Conny E.
Wright Joshua D.
Wu Sherry J.
Xiao Qinyu
Yoon Sangsuk
Yu Karen
Yucel Meltem
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Drawing on the concept of a gale of creative destruction in a capitalistic economy, we argue that initiatives to assess the robustness of findings in the organizational literature should aim to simultaneously test competing ideas operating in the same theoretical space. In other words, replication efforts should seek not just to support or question the original findings, but also to replace them with revised, stronger theories with greater explanatory power. Achieving this will typically require adding new measures, conditions, and subject populations to research designs, in order to carry out conceptual tests of multiple theories in addition to directly replicating the original findings. To illustrate the value of the creative destruction approach for theory pruning in organizational scholarship, we describe recent replication initiatives re-examining culture and work morality, working parents\u2019 reasoning about day care options, and gender discrimination in hiring decisions. Significance statement It is becoming increasingly clear that many, if not most, published research findings across scientific fields are not readily replicable when the same method is repeated. Although extremely valuable, failed replications risk leaving a theoretical void\u2014 reducing confidence the original theoretical prediction is true, but not replacing it with positive evidence in favor of an alternative theory. We introduce the creative destruction approach to replication, which combines theory pruning methods from the field of management with emerging best practices from the open science movement, with the aim of making replications as generative as possible. In effect, we advocate for a Replication 2.0 movement in which the goal shifts from checking on the reliability of past findings to actively engaging in competitive theory testing and theory building. Scientific transparency statement The materials, code, and data for this article are posted publicly on the Open Science Framework, with links provided in the article

Loughborough University Institutional Repository

VU Research Portal

Stirling Online Research Repository (RIOXX)

University of Birmingham Research Portal

Archivio istituzionale della ricerca - Università dell'Insubria

Plymouth Electronic Archive and Research Library

Stirling Online Research Repository

White Rose Research Online

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Tilburg University Repository

Keele Research Repository

Kingston University Research Repository

Radboud Repository

University of St. Andrews - Pure

Transliteration normalization for Information Extraction and Machine Translation

Author: Imed Zitouni
Yuval Marton
Publication venue: 'Elsevier BV'
Publication date: 01/12/2014
Field of study

Foreign name transliterations typically include multiple spelling variants. These variants cause data sparseness and inconsistency problems, increase the Out-of-Vocabulary (OOV) rate, and present challenges for Machine Translation, Information Extraction and other natural language processing (NLP) tasks. This work aims to identify and cluster name spelling variants using a Statistical Machine Translation method: word alignment. The variants are identified by being aligned to the same “pivot” name in another language (the source-language in Machine Translation settings). Based on word-to-word translation and transliteration probabilities, as well as the string edit distance metric, names with similar spellings in the target language are clustered and then normalized to a canonical form. With this approach, tens of thousands of high-precision name transliteration spelling variants are extracted from sentence-aligned bilingual corpora in Arabic and English (in both languages). When these normalized name spelling variants are applied to Information Extraction tasks, improvements over strong baseline systems are observed. When applied to Machine Translation tasks, a large improvement potential is shown

Elsevier - Publisher Connector

Directory of Open Access Journals

Soft Syntactic Constraints for Hierarchical Phrased-Based Translation

Author: Philip Resnik
Yuval Marton
Publication venue
Publication date: 01/01/2008
Field of study

In adding syntax to statistical MT, there is a tradeoff between taking advantage of linguistic analysis, versus allowing the model to exploit linguistically unmotivated mappings learned from parallel training data. A number of previous efforts have tackled this tradeoff by starting with a commitment to linguistically motivated analyses and then finding appropriate ways to soften that commitment. We present an approach that explores the tradeoff from the other direction, starting with a context-free translation model learned directly from aligned parallel text, and then adding soft constituent-level constraints based on parses of the source language. We obtain substantial improvements in performance for translation from Chinese and Arabic to English.

CiteSeerX

Thematic Fit Bits: Annotation Quality and Quantity Interplay for Event Participant Representation

Author: Marton Yuval
Sayeed Asad
Publication venue
Publication date: 04/05/2022
Field of study

Modeling thematic fit (a verb--argument compositional semantics task) currently requires a very large burden of labeled data. We take a linguistically machine-annotated large corpus and replace corpus layers with output from higher-quality, more modern taggers. We compare the old and new corpus versions' impact on a verb--argument fit modeling task, using a high-performing neural approach. We discover that higher annotation quality dramatically reduces our data requirement while demonstrating better supervised predicate-argument classification. But in applying the model to psycholinguistic tasks outside the training objective, we see clear gains at scale, but only in one of two thematic fit estimation tasks, and no clear gains on the other. We also see that quality improves with training size, but perhaps plateauing or even declining in one task. Last, we tested the effect of role set size. All this suggests that the quality/quantity interplay is not all you need. We replicate previous studies while modifying certain role representation details and set a new state-of-the-art in event modeling, using a fraction of the data. We make the new corpus version public.Comment: Published in LREC 2022; 8.5 pages before references, 11 pages tota

arXiv.org e-Print Archive

Online large-margin training of syntactic and structural translation features

Author: David Chiang
Philip Resnik
Yuval Marton
Publication venue
Publication date: 01/01/2008
Field of study

Minimum-error-rate training (MERT) is a bottleneck for current development in statistical machine translation because it is limited in the number of weights it can reliably optimize. Building on the work of Watanabe et al., we explore the use of the MIRA algorithm of Crammer et al. as an alternative to MERT. We first show that by parallel processing and exploiting more of the parse forest, we can obtain results using MIRA that match or surpass MERT in terms of both translation quality and computational cost. We then test the method on two classes of features that address deficiencies in the Hiero hierarchical phrasebased model: first, we simultaneously train a large number of Marton and Resnik’s soft syntactic constraints, and, second, we introduce a novel structural distortion model. In both cases we obtain significant improvements in translation performance. Optimizing them in combination, for a total of 56 feature weights, we improve performance by 2.6 Bleu on a subset of the NIST 2006 Arabic-English evaluation data

CiteSeerX

Crossref

Estimating semantic distance using soft semantic constraints in knowledge-source-corpus hybrid models

Author: Philip Resnik
Saif Mohammad
Yuval Marton
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2009
Field of study

Strictly corpus-based measures of seman-tic distance conflate co-occurrence infor-mation pertaining to the many possible senses of target words. We propose a corpus–thesaurus hybrid method that uses soft constraints to generate word-sense-aware distributional profiles (DPs) from coarser “concept DPs ” (derived from a Roget-like thesaurus) and sense-unaware traditional word DPs (derived from raw text). Although it uses a knowledge source, the method is not vocabulary-limited: if the target word is not in the thesaurus, the method falls back grace-fully on the word’s co-occurrence infor-mation. This allows the method to access valuable information encoded in a lexical resource, such as a thesaurus, while still being able to effectively handle domain-specific terms and named entities. Exper-iments on word-pair ranking by semantic distance show the new hybrid method to be superior to others.

CiteSeerX

Crossref